Hippocampus Substructure Segmentation Using Morphological Vision Transformer Learning

Background: The hippocampus plays a crucial role in memory and cognition. Because of the associated toxicity from whole brain radiotherapy, more advanced treatment planning techniques prioritize hippocampal avoidance, which depends on an accurate segmentation of the small and complexly shaped hippocampus. Purpose: To achieve accurate segmentation of the anterior and posterior regions of the hippocampus from T1 weighted (T1w) MRI images, we developed a novel model, Hippo-Net, which uses a mutually enhanced strategy. Methods: The proposed model consists of two major parts: 1) a localization model is used to detect the volume-of-interest (VOI) of hippocampus. 2) An end-to-end morphological vision transformer network is used to perform substructures segmentation within the hippocampus VOI. The substructures include the anterior and posterior regions of the hippocampus, which are defined as the hippocampus proper and parts of the subiculum. The vision transformer incorporates the dominant features extracted from MRI images, which are further improved by learning-based morphological operators. The integration of these morphological operators into the vision transformer increases the accuracy and ability to separate hippocampus structure into its two distinct substructures. A total of 260 T1w MRI datasets from Medical Segmentation Decathlon dataset were used in this study. We conducted a five-fold cross-validation on the first 200 T1w MR images and then performed a hold-out test on the remaining 60 T1w MR images with the model trained on the first 200 images. The segmentations were evaluated with two indicators, 1) multiple metrics including the Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), mean surface distance (MSD), volume difference (VD) and center-of-mass distance (COMD); 2) Volumetric Pearson correlation analysis. Results: In five-fold cross-validation, the DSCs were 0.900±0.029 and 0.886±0.031 for the hippocampus proper and parts of the subiculum, respectively. The MSD were 0.426±0.115mm and 0.401±0.100 mm for the hippocampus proper and parts of the subiculum, respectively. Conclusions: The proposed method showed great promise in automatically delineating hippocampus substructures on T1w MRI images. It may facilitate the current clinical workflow and reduce the physicians’ effort.


INTRODUCTION
The hippocampus is a pair of medial and subcortical brain structures located in proximity to the temporal horn of the lateral ventricles, which is an active research area due to its implication in memory and neuropsychiatric disorders.[14] In radiation therapy, hippocampal avoidance whole brain radiation using volumetric modulated arc therapy (VMAT) plus the medication memantine has been shown to preserve cognitive function without compromising progression-free survival or overall survival when compared to classic whole brain radiation therapy plus memantine.[3,2] In Alzheimer's Disease (AD), the progression of AD occurs from the trans-entorhinal cortex to the hippocampus, and finally to the neocortex.[2] These progression steps depend on the severity of the neurofibrillary tangles found in neuropathological studies.However, similar patterns can also be observed in the progress of brain atrophy found on MRI imaging studies.The atrophy of hippocampus measured from MRIs can be used as an early sign of AD progression.[1] Additionally, evidence of hippocampal atrophy as measured from MRIs can occur before the onset of clinical symptoms.[14] Therefore, accurate segmentation of the hippocampus from MRIs is a meaningful task in medical image analysis across multiple disciplines.[26] To determine whether the hippocampus is atrophic, clinicians often need to segment the bilateral hippocampus on MRI scans and analyze their shape and volume.[19,27] This task is difficult, however, due to several factors.Firstly, the hippocampus has low contrast with the surrounding tissues on MRI scans, [25] since it is a gray matter structure.Secondly, the hippocampus has an irregular shape leading to a blurred boundary in cross-sectional slices.[5] Thirdly, the hippocampus is a small structure with limited volume as compared to other structures that are routinely delineated as organs-at-risk (OARs) in radiation therapy.[8] Finally, there are large variations in the size and shape of the hippocampus across patients.[4] Therefore, accurate and automatic segmentation of hippocampus is a challenging task.Until now, manual segmentation of hippocampus is still the standard in clinical practice.[13] However, manual segmentation is a tedious and error-prone process, which limits its application in big data and clinical practice.Thus, many efforts have been devoted to developing computer-aided diagnostic systems for automated segmentation of the hippocampus.
The existing automatic hippocampal segmentation methods can be categorized into two main types: atlas-based methods and machine learning-based methods.Atlas-based methods can be further divided based on the number of atlases used in the segmentation process into single-atlas-based, averageshape atlas-based, and multi-atlas-based approaches.For instance, Haller et al. first proposed to use the single-atlas-based approach for hippocampal segmentation.[11,10] However, single-atlas-based approaches are limited by inter-patient variations.To address this, average shape-based mapping approaches were proposed to overcome such limitations, but the segmentation results depend on the alignment quality of the target and average maps.Thus, a priori knowledge of medical mapping was incorporated into the multi-atlas-based segmentation approach.For example, Wang et al. proposed a robust discriminative multi-atlas label fusion approach to segment hippocampus by building the conditional random field (CRF) model that combines distance metric learning and graph cuts.[28] Wang's approach is a patch embedding multi-atlas label fusion method that utilizes only the relationship between the target block and the atlas block, and ignores the possibility that unrelated atlas blocks may dominate the voting process.Existing atlas-based methods do not consider the anatomical differences in hippocampus among patients, and do not consider the correlation between atlases.
Machine learning-based methods can be further classified into traditional machine learning-based approaches and deep learning-based approaches.Traditional machine learning-based approaches mainly include support vector machine (SVM), Markov random field (MRF), principal component analysis (PCA), et al. [15,17] For instance, Hao et al. proposed a local label learning strategy to estimate segmentation labels of target images by using SVM with image intensity and texture features.[12] However, these traditional approaches to machine learning rely heavily on the quality of handcrafted features, and further suffer from slow segmentation, susceptibility to noise interference, and insufficient generalization performance.[16] Because convolutional neural network (CNN) models can automatically extract the pixel feature information from images, they have been widely used in multiple medical image analysis tasks.[9] For example, CNN-based models can be used to segment the hippocampus from MRIs.[20] Qiu et al. proposed a multitask 3D U-net framework for hippocampus segmentation by minimizing the difference between the targeted binary mask and the model prediction, and optimizing an auxiliary edge-prediction task.[23] Cao et al. developed a two-stage segmentation method to perform the task of 3D hippocampus segmentation by localizing multi-size candidate regions and fusing the multi-size candidate regions.[6] These methods show promising results, demonstrating the potential of CNN-based models to improve the efficiency and accuracy of hippocampus segmentation.However, most existing deep learning-based methods ignore the spatial information of the hippocampus relative to the entirety of the human brain.As a result, they cannot effectively fuse the shape features and the semantic features, which leads to lower segmentation accuracy.Hippocampal tracing began from anterior where the head is visible as an enclosed gray matter structure inferior to the amygdala, and continued posteriorly using surrounding white matter or CSF as boundaries.Subiculum (posterior parts of hippocampus) was included in the hippocampus.Delineation stopped when the wall of the ventricle was visibly contiguous with the fimbria.The subiculum occupies a portion of the para-hippocampal gyrus in the mesial temporal lobe and is a component of the medial temporal memory system.Therefore, in this work, we aim to develop a novel deep network framework to segment the hippocampus by introducing a spatial attention mechanism to capture the spatial location information of the hippocampus relative to the brain.We also designed a cross-layer dual encoding shared decoding network to extract the semantic characteristics of the hippocampus.By combining the spatial location information and semantic characteristics of the hippocampus, we enhanced the segmentation accuracy of the hippocampus.In this study, we trained a novel morphological visual transformer learning-based hippocampus substructure segmentation for accurate segmentation of the anterior and posterior regions of the hippocampus from T1 weighted (T1w) MR images.

Overview
Figure 1 outlines the schematic flow chart of this hippocampus multi-substructure segmentation process.The proposed network follows the same feedforward path for both training and inference.A collection of hippocampus images and multi-substructure contours was used for model training.The proposed model, named as morphological visual transformer-based network, takes the hippocampus image as input and generates the auto-contour of two substructures, which are the hippocampus proper and parts of the subiculum.The manual contours of these two substructures were used as ground truth to supervise the proposed network.
The proposed model consists of two deep learning-based subnetworks, i.e., a localization model and a segmentation model.The localization model is a hippocampus ) detection network that is used to detect the volume-of-interest (VOI) for both the hippocampus proper and parts of the subiculum [7] from the T1w MR image.The MR image is then cropped within the VOI before transfer to the segmentation subnetwork to ease the computational task.The segmentation model is implemented via an end-toend morphological vision transformer network, which is used to perform substructures segmentation within the hippocampus VOI.The vision transformer incorporates the dominant features extracted from MR images.The integration of the morphological operators into the vision transformer increases the ability of separating the hippocampus into two substructures.
During inference, the trained localization model takes a hippocampus T1w MR image as input and detects the VOI of hippocampus as the first step.Then, the cropped image within the VOI is sent to the segmentation model, i.e., morphological visual transformer, to segment the substructures.Finally, based on the detected coordinates derived by the localization model, the segmented contour is converted back to its original coordinates to obtain the final segmentation.

Localization model
The aim of the localization model is to crop the image to a VOI that only covers the hippocampus to ease computational task of substructure segmentation.In order to preserve the spatial information of substructure, the coordinate the detected VOI is recorded during testing.Thus, the localization of ground truth hippocampus is used to supervise the localization model.To derive it, the manual contour is needed.For a set of MR images I Img ∈ R (w×h×d) , where w and h denote the width and height of the I Img , d The localization model design is inspired by a recently developed focal modulation network, which is used in object detection.[26] The localization model includes a hierarchical contextualization, which is used for feature extraction from different hierarchical levels, a modulator, which combines the features from different levels, and a neural network layer works for location position estimation.The details of the localization model is explained as follows.
Given input MRI I Img ∈ R (w×h×d) , with a first convolution layer for feature map initiating F 0 , a multiscale hierarchy feature map set are collected via the steps defined as follows iteratively: where F k−1 denotes the feature map from previous iteration, F k is then derived by the operating convolution and Gaussian error linear units (GeLU) activation function.[13] After several iterations of Eq.
(1), multi-hierarchical features are collected, we then match these feature maps to same size via interpolation and sum together Then, by using a neural network layer, we aim to derive the estimation of C, labeled as Ĉ = [ xc , ŷc , ẑc , ŵc , ĥc , dc ], from the F m .To achieve this aim, we set the loss function, as shown in Eq. ( 3) during the training of localization module.

Morphological visual transformer
For the next step, the MRI I Img are cropped within a VOI box, whose center is defined as Ĉ This process mitigates the unrelated region for hippocampus segmentation and thus improve the efficiency of the model.To ensure the cropped image is uniformly sized for the following subnetwork, zero-padding is used.The processed image is then input into the morphological visual transformer (MVT).The MVT is built in an end-to-end fashion, meaning that the input and output share the same size.After several convolutional layers with a stride size of 2, the MVT uses two auto-learned morphological operators, dilation and erosion, to process the hidden feature maps.As compared to convolutional kernel with stride size of 2 or max-pooling layer, which can be regarded as a dilation with a flat square structuring element followed by a pooling, the learned morphological operator can be tuned to aggregate the most important information.This can further reduce the redundant and meaningless information for the next operator, the visual transformer, and therefore improve its performance.The output of the two morphological operators is then concatenated and fed into a projection convolutional layer and a linear projection operator to fit it to the input of visual transformer.A widely developed visual transformer is used.[24] Afterwards, several deconvolutional layers are applied until the output of this MVT model is equal in size to the input.
After the MVT step, consolidation can be used to transform the segmentation back to the original coordinate system (I img ), since the location information has been obtained from the localization model.
To supervise the MVT, a combination of two loss functions is used, which are generalized cross entropy loss L GCE and generalized Dice loss L GD .The L GCE is used to evaluate the difference between the predicted label and the ground truth label at each voxel, which is defined as: where l i denotes the ground truth label at voxel i, li denotes the predicted label at voxel i.
The L GD is used to address the issues about the voxel quantity imbalance of the segmented voxels (often a small portion of the whole image) and background (large portion), which is defined as: where ϵ is a small value.The weighted sum of these two loss terms is then used to train the MVT model.

Dataset
In total, 260 T1w MR images from Medical Segmentation Decathlon were used in this study.[1] The Medical Segmentation Decathlon is a dataset consisting of T1-weighted magnetization-prepared rapid gradient echo (MPRAGE) MRIs of both healthy adults (ninety healthy adults) and adults with a nonaffective psychotic disorder.The corresponding target Region of Interest (ROIs) were the anterior and posterior of the hippocampus, defined as the hippocampus proper and parts of the subiculum.This dataset was selected due to the precision needed to segment such a small object in the presence of a complex surrounding environment.
We conducted a five-fold cross-validation study on the first 200 T1w MR images.Then, a hold-out test was performed on the remaining 60 images using a model trained on the first 200 images.The segmentation was evaluated with multiple quantitative metrics including the Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), mean surface distance (MSD), volume difference (VD) and center-of-mass distance (COMD).A Bland-Altman analysis and volumetric Pearson correlation analysis were also performed.

Implementation and evaluation
The investigated deep learning networks were designed using Python 3.6 and TensorFlow and implemented on a GeForce RTX 2080 GPU that had 12GB of memory.Optimization was performed using the Adam gradient optimizer.The learning rate was 2×10-4.With the batch size setting of 20 during training, the percentage of utility of GPU memory is 96%.Once the network was trained, it only takes 1.5 mins for hippocampus segmentation.To demonstrate the utility of morphological operator, an ablation study was conducted.Namely, we tested the performance of the proposed method of with and without using morphological operator.To further demonstrate the significance of the proposed work, we compared the proposed method with another popular segmentation models, cascaded U-Net (CasU) [18] and visual transformer network (VIT).[24] Comparisons were performed using the same training and testing datasets and computational environment.

Comparing with state-of-the-art
The visual comparison between the proposed method and comparing methods are shown in Fig. 2.
As can be seen from the first row, the proposed method shows good agreement with the ground truth, whereas the comparing methods cannot.In the second row it is observed that misclassification of posterior part occurs for the cascaded U-Net.To better demonstrate the segmentation accuracy, we performed absolute subtraction of the segmentation results of the proposed method and comparing methods with the manual contour's binary masks.The difference images are shown in the fourth to sixth rows.As can be seen from the fifth and sixth rows, the difference images of the two comparing methods show greater error at the adjacent part between the hippocampus proper and parts of the subiculum.The linear correlation coefficient calculated as target volume of ground truth and segmentation, is shown in Fig. 3.The linear correlation coefficient obtained using the proposed method was 0.999 and 0.993 on five-fold cross-validation and hold-out test, respectively.These values indicate a good agreement between the ground truth and proposed results, as compared to 0.989/0.983and 0.991/0.979obtained by the cascaded U-Net and VIT, respectively on five-fold cross-validation/hold-out test.On hold-out test, the VIT consistently underestimated the region , which became more pronounced for larger tumors.
The quantitative metrics of the proposed method and the alternate methods from the 200 cases' crossvalidation and 60 cases' hold-out test are listed in Table 1 and 2, and Table 3 and 4, respectively.For the cross-validation experiment, the proposed model significantly outperformed Cascaded U-Net and VIT in all metrics.In five-fold cross-validation, the DSCs, HD95, MSD and CMD were 0.900±0.029and 0.886±0.031,1.156±0.277and 1.133±0.264,0.426±0.115and 0.401±0.100,0.491±0.300and 0.738±0.452for the hippocampus proper and parts of the subiculum, respectively.
In the hold-out test using external datasetthe proposed model is significantly superior to the alternate approaches, as shown in

Discussion
A novel hippocampus segmentation method (called MVT) is proposed by introducing a localization mechanism to aid segmentation and designing the morphological visual transformer network for substructures segmentation.The localization model is used to detect the VOI of hippocampus.The end-toend morphological vision transformer network is used to perform substructures segmentation within the hippocampus VOI.The substructures include the anterior and posterior regions of the hippocampus, which are defined as the hippocampus proper and parts of the subiculum.The vision transformer incorporates the dominant features extracted from MRI images and is improved by learning-based morphological operators.The morphological operators integrated into the vision transformer enhance the ability to separate the hippocampus structure into two substructures.
Due to limited computational resources, our method focused on domain incremental learning with a cropped region for analysis.We plan to test the performance of our method in a class incremental setup.As the visual transformer contains several orders of magnitude larger number of parameters due to the self-adapting process as compared to the traditional CNNs, it is essential to investigate an effective optimization method to reduce the amount of GPU memory allocation as well as simplify the overall ViT U-Net architecture.
Our MVT is a supervised method, which means it still requires accurate manual contours as training labels.Currently, there are semi-supervised learning methods that can learn features from unlabeled data.We will extend the proposed method with the ensemble approach to improve its generalization performance by integrating the supervision learning and semi-supervised learning methods from the limited labeled data and large-scale unlabeled data of MRIs in a future study.
The auto-segmentation of substructures of hippocampus has significant clinical relevance.For example, in hippocampal sparing whole brain radiation therapy (HA-WBRT), [22] current intensity modulated radiation treatment (IMRT) and arc-based VMAT techniques can reduce dose to the hippocampus without sacrificing target coverage and homogeneity.[29] Further improvements in patient outcomes may be possible by considering substructures separately for optimal dose sparing; however, accurate segmentation is critical.With more accurate contouring of substructures of hippocampus, it is possible to have different dose constraints of these substructures in HA-WBRT, [21] allowing for better sparing of the critical part of the hippocampus.

Conclusion
We have developed a novel deep learning-based method to accurately segment the anterior and posterior of hippocampus.Our results showed good performance in terms of DSC and VD between the segmentation result and the ground truth.

Figure 1 .
Figure 1.The workflow of the proposed morphological visual transformer learning-based hippocampus substructure segmentation.

Figure 2 .
Figure 2. A representative case of proposed method and state-of-the-art methods.The 1st column shows MR images.The 2nd column shows the ground truth contour.The 3rd column shows the results of proposed method.The 4th column and 5th column show the results of cascaded U-Net and VIT, respectively.The last three rows are related to the absolute difference between segmented ones and ground truth ones.

Figure 3 .
Figure 3. Bland-Altman analysis of the segmented volumes between ground truth (semi-log scale) against the proposed method and comparing methods.Each dot indicates a data point from the dataset for that model.(a) row denotes the results of five-fold cross-validation.(b) row denotes the results of hold-out test.First column denotes the segmentation of first substructure.Second column denotes the segmenting results of second substructure.

TABLE I
. Numerical results (hippocampus proper) on 5-fold cross-validation of proposed method, cascaded U-Net and VIT, respectively.

TABLE II .
Numerical results (parts of the subiculum) on 5-fold cross-validation of proposed method, cascaded U-Net and VIT, respectively.As compared to five-fold cross-validation, the hold-out test did slightly worse with slightly higher standard deviation, which may be caused by the training data's distribution not covering the range of cases in the hold-out test.